1 Introduction

SeqKit is a cross-platform ultra-fast comprehensive toolkit for FASTA/Q processing. It works on Linux, macOS and Windows, and can be directly used without any dependencies or pre-configuration. Further information can be found on the project website and technical details are available in the research article.

3 Tutorial

In this tutorial we are going to learn how to use SeqKit to process and manipulate FASTA/Q files. The first thing to do is create a directory to store all the tutorial data. It is good practice to create a new directory for each project you work on, this ensures files do not get mixed up and all the results are self-contained.

Create a ‘tutorial’ directory to store output files:

bash
mkdir tutorial

Download the tutorial and exercise data:

bash
curl https://raw.githubusercontent.com/zifornd/bioinformatics-workshop/main/data/formats/data.tar.gz --output tutorial/data.tar.gz
##   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
##                                  Dload  Upload   Total   Spent    Left  Speed
## 
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0 13.2M    0  114k    0     0   104k      0  0:02:09  0:00:01  0:02:08  105k
  2 13.2M    2  387k    0     0   184k      0  0:01:13  0:00:02  0:01:11  184k
  3 13.2M    3  522k    0     0   168k      0  0:01:20  0:00:03  0:01:17  168k
  5 13.2M    5  719k    0     0   174k      0  0:01:17  0:00:04  0:01:13  175k
  6 13.2M    6  879k    0     0   172k      0  0:01:18  0:00:05  0:01:13  176k
  8 13.2M    8 1135k    0     0   186k      0  0:01:12  0:00:06  0:01:06  204k
 10 13.2M   10 1455k    0     0   205k      0  0:01:06  0:00:07  0:00:59  214k
 13 13.2M   13 1791k    0     0   220k      0  0:01:01  0:00:08  0:00:53  253k
 14 13.2M   14 1903k    0     0   208k      0  0:01:04  0:00:09  0:00:55  236k
 14 13.2M   14 1983k    0     0   195k      0  0:01:09  0:00:10  0:00:59  219k
 15 13.2M   15 2047k    0     0   184k      0  0:01:13  0:00:11  0:01:02  181k
 16 13.2M   16 2175k    0     0   179k      0  0:01:15  0:00:12  0:01:03  144k
 17 13.2M   17 2335k    0     0   178k      0  0:01:15  0:00:13  0:01:02  109k
 19 13.2M   19 2639k    0     0   186k      0  0:01:12  0:00:14  0:00:58  145k
 20 13.2M   20 2815k    0     0   186k      0  0:01:12  0:00:15  0:00:57  167k
 21 13.2M   21 2975k    0     0   184k      0  0:01:13  0:00:16  0:00:57  185k
 23 13.2M   23 3215k    0     0   188k      0  0:01:12  0:00:17  0:00:55  208k
 27 13.2M   27 3743k    0     0   206k      0  0:01:05  0:00:18  0:00:47  281k
 31 13.2M   31 4287k    0     0   224k      0  0:01:00  0:00:19  0:00:41  331k
 32 13.2M   32 4447k    0     0   221k      0  0:01:01  0:00:20  0:00:41  325k
 33 13.2M   33 4559k    0     0   216k      0  0:01:02  0:00:21  0:00:41  317k
 34 13.2M   34 4655k    0     0   210k      0  0:01:04  0:00:22  0:00:42  287k
 35 13.2M   35 4767k    0     0   206k      0  0:01:05  0:00:23  0:00:42  205k
 37 13.2M   37 5023k    0     0   208k      0  0:01:04  0:00:24  0:00:40  148k
 38 13.2M   38 5247k    0     0   209k      0  0:01:04  0:00:25  0:00:39  160k
 40 13.2M   40 5439k    0     0   208k      0  0:01:05  0:00:26  0:00:39  174k
 41 13.2M   41 5615k    0     0   207k      0  0:01:05  0:00:27  0:00:38  192k
 42 13.2M   42 5807k    0     0   206k      0  0:01:05  0:00:28  0:00:37  207k
 45 13.2M   45 6191k    0     0   212k      0  0:01:03  0:00:29  0:00:34  233k
 48 13.2M   48 6575k    0     0   218k      0  0:01:02  0:00:30  0:00:32  265k
 51 13.2M   51 6911k    0     0   222k      0  0:01:00  0:00:31  0:00:29  296k
 51 13.2M   51 7039k    0     0   219k      0  0:01:01  0:00:32  0:00:29  284k
 52 13.2M   52 7167k    0     0   216k      0  0:01:02  0:00:33  0:00:29  272k
 55 13.2M   55 7471k    0     0   219k      0  0:01:01  0:00:34  0:00:27  256k
 57 13.2M   57 7807k    0     0   222k      0  0:01:00  0:00:35  0:00:25  246k
 59 13.2M   59 7999k    0     0   221k      0  0:01:01  0:00:36  0:00:25  217k
 59 13.2M   59 8031k    0     0   216k      0  0:01:02  0:00:37  0:00:25  198k
 59 13.2M   59 8095k    0     0   212k      0  0:01:03  0:00:38  0:00:25  184k
 61 13.2M   61 8287k    0     0   211k      0  0:01:03  0:00:39  0:00:24  163k
 62 13.2M   62 8431k    0     0   210k      0  0:01:04  0:00:40  0:00:24  124k
 62 13.2M   62 8511k    0     0   207k      0  0:01:05  0:00:41  0:00:24  101k
 63 13.2M   63 8607k    0     0   204k      0  0:01:06  0:00:42  0:00:24  115k
 64 13.2M   64 8703k    0     0   201k      0  0:01:07  0:00:43  0:00:24  121k
 65 13.2M   65 8863k    0     0   200k      0  0:01:07  0:00:44  0:00:23  115k
 67 13.2M   67 9119k    0     0   202k      0  0:01:07  0:00:45  0:00:22  137k
 68 13.2M   68 9279k    0     0   201k      0  0:01:07  0:00:46  0:00:21  154k
 69 13.2M   69 9455k    0     0   200k      0  0:01:07  0:00:47  0:00:20  169k
 71 13.2M   71 9631k    0     0   200k      0  0:01:07  0:00:48  0:00:19  186k
 72 13.2M   72 9839k    0     0   200k      0  0:01:07  0:00:49  0:00:18  193k
 73 13.2M   73  9.7M    0     0   199k      0  0:01:07  0:00:50  0:00:17  179k
 76 13.2M   76 10.0M    0     0   201k      0  0:01:07  0:00:51  0:00:16  204k
 78 13.2M   78 10.4M    0     0   205k      0  0:01:06  0:00:52  0:00:14  246k
 81 13.2M   81 10.8M    0     0   208k      0  0:01:04  0:00:53  0:00:11  287k
 84 13.2M   84 11.2M    0     0   212k      0  0:01:03  0:00:54  0:00:09  336k
 90 13.2M   90 11.9M    0     0   221k      0  0:01:01  0:00:55  0:00:06  433k
 94 13.2M   94 12.4M    0     0   228k      0  0:00:59  0:00:56  0:00:03  499k
100 13.2M  100 13.2M    0     0   237k      0  0:00:56  0:00:56 --:--:--  590k

Extract the archive file into the tutorial directory:

bash
tar xf tutorial/data.tar.gz --directory=tutorial

3.1 Install SeqKit

The software we are going to use in this tutorial can be installed using the conda package manager. Please refer to the previous conda workshop for details on installing software and creating conda environments.

Create a new environment with SeqKit installed:

bash
conda create --yes --name seqkit seqkit
## Collecting package metadata (current_repodata.json): ...working... done
## Solving environment: ...working... done
## 
## ## Package Plan ##
## 
##   environment location: /opt/miniconda3/envs/seqkit
## 
##   added / updated specs:
##     - seqkit
## 
## 
## The following NEW packages will be INSTALLED:
## 
##   seqkit             bioconda/osx-64::seqkit-2.3.1-h527b516_0
## 
## 
## Preparing transaction: ...working... done
## Verifying transaction: ...working... done
## Executing transaction: ...working... done
## #
## # To activate this environment, use
## #
## #     $ conda activate seqkit
## #
## # To deactivate an active environment, use
## #
## #     $ conda deactivate
## 
## Retrieving notices: ...working... done

Activate the new environment to use it:

bash
conda activate seqkit

Test that the seqkit command is available:

bash
which seqkit
## /opt/miniconda3/envs/seqkit/bin/seqkit

3.2 Perform basic operations

We will start the tutorial by performing some basic operations on FASTA/Q files. The most popular operations include transforming sequences, calculating statistics, and extracting sub-sequences.

3.2.1 Transform sequences

Display help information for seq subcommand:

bash
seqkit seq --help
## transform sequences (extract ID, filter by length, remove gaps, reverse complement...)
## 
## Usage:
##   seqkit seq [flags]
## 
## Flags:
##   -k, --color                     colorize sequences - to be piped into "less -R"
##   -p, --complement                complement sequence, flag '-v' is recommended to switch on
##       --dna2rna                   DNA to RNA
##   -G, --gap-letters string        gap letters (default "- \t.")
##   -h, --help                      help for seq
##   -l, --lower-case                print sequences in lower case
##   -M, --max-len int               only print sequences shorter than the maximum length (-1 for no limit) (default -1)
##   -R, --max-qual float            only print sequences with average quality less than this limit (-1 for no limit) (default -1)
##   -m, --min-len int               only print sequences longer than the minimum length (-1 for no limit) (default -1)
##   -Q, --min-qual float            only print sequences with average quality qreater or equal than this limit (-1 for no limit) (default -1)
##   -n, --name                      only print names
##   -i, --only-id                   print ID instead of full head
##   -q, --qual                      only print qualities
##   -b, --qual-ascii-base int       ASCII BASE, 33 for Phred+33 (default 33)
##   -g, --remove-gaps               remove gaps
##   -r, --reverse                   reverse sequence
##       --rna2dna                   RNA to DNA
##   -s, --seq                       only print sequences
##   -u, --upper-case                print sequences in upper case
##   -v, --validate-seq              validate bases according to the alphabet
##   -V, --validate-seq-length int   length of sequence to validate (0 for whole seq) (default 10000)
## 
## Global Flags:
##       --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
##       --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
##       --id-regexp string                regular expression for parsing ID (default "^(\\S+)\\s?")
##       --infile-list string              file of input files list (one file per line), if given, they are appended to files from cli arguments
##   -w, --line-width int                  line width when outputting FASTA format (0 for no wrap) (default 60)
##   -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
##       --quiet                           be quiet and do not show extra information
##   -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
##   -j, --threads int                     number of CPUs. can also set with environment variable SEQKIT_THREADS) (default 4)

Only print sequence names:

bash
seqkit seq --name tutorial/data/example/genes.fasta
## GENE-1
## GENE-2
## GENE-3
## GENE-4
## GENE-5
## GENE-6
## GENE-7
## GENE-8
## GENE-9
## GENE-10

Only print sequences:

bash
seqkit seq --seq tutorial/data/example/genes.fasta
## ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAG
## ATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTTTTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCAGAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGATGCTCGAA
## ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACCAGAACTCAATTACCCCCTGCATACACTAATTCTTTCACACGTGGTGTTTATTACCCTGACAAAGTTTTCAGATCCTCAGTTTTACATTCAACTCAGGACTTGTTCTTACCTTTCTTTTCCAATGTTACTTGGTTCCATGCTATACATGTCTCTGGGACCAATGGTACTAAGAGGTTTGATAACCCTGTCCTACCATTTAATGATGGTGTTTATTTTGCTTCCACTGAGAAGTCTAACATAATAAGAGGCTGGATTTTTGGTACTACTTTAGATTC
## ATGGATTTGTTTATGAGAATCTTCACAATTGGAACTGTAACTTTGAAGCAAGGTGAAATCAAGGATGCTACTCCTTCAGATTTTGTTCGCGCTACTGCAACGATACCGATACAAGCCTCACTCCCTTTCGGATGGCTTATTGTTGGCGTTGCACTTCTTGCTG
## ATGTACTCATTCGTTTCGGAAGAGACAGGTACGTTAATAGTTAATAGCGTACTTCTTTTTCTTGCTTTCGTGGTATTCTTGC
## ATGGCAGATTCCAACGGTACTATTACCGTTGAAGAGCTTAAAAAGCTCCTTGAACAATGGAACCTAGTAATAGGTTTCCTATTCCTTACATGGATTTGTCTTCTACAATTTGCCTATGCCAACAGGAATAGGTTTTTGTATATAATTAAGTTAATTTTCCTCTGGCTGT
## ATGTTTCATCTCGTTGACTTTCAGGTTACTATAGCAGAGATATTACTAATTATTATGAGGACTTTTAAAGTTTCCAT
## ATGAAAATTATTCTTTTCTTGGCACTGATAACACTCGCTACTTGTGAGCTTTATCACTACCAAGAGTGTGTTAGAGGTACAACAGTACTTTTAAAAGAACCTTGCTCTTCTGGAACATACGAGGGCAATTCACCATTTCATCCTCTAGCTGATAACAAATTTGCACTGACTTGCTTTAGCACTCA
## ATGATTGAACTTTCATTAATTGACTTCTATTTGTGCTTTTTAGCCTTTCTGCTATTCCTTGTTTTAATTATGCTTATTATCTTTTGGTTCTCACTTGAACTG
## ATGAAATTTCTTGTTTTCTTAGGAATCATCACAACTGTAGCTGCATTTCACCAAGAATGTAGTTTACAGTCATGTACTCAACATCAACCATATGTAGTTGATGACCCGTGTCCTATTCACTTCTATTCTAAATGGTATATTAGAGTAGGAGCTAGAAAATC

Filter by sequence length:

bash
# Only print sequences shorter than 250 characters
seqkit seq --max-len 250 tutorial/data/example/genes.fasta
## [WARN] you may switch on flag -g/--remove-gaps to remove spaces
## >GENE-2
## ATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTT
## TTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCA
## GAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTT
## TTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGATGCTCGAA
## >GENE-4
## ATGGATTTGTTTATGAGAATCTTCACAATTGGAACTGTAACTTTGAAGCAAGGTGAAATC
## AAGGATGCTACTCCTTCAGATTTTGTTCGCGCTACTGCAACGATACCGATACAAGCCTCA
## CTCCCTTTCGGATGGCTTATTGTTGGCGTTGCACTTCTTGCTG
## >GENE-5
## ATGTACTCATTCGTTTCGGAAGAGACAGGTACGTTAATAGTTAATAGCGTACTTCTTTTT
## CTTGCTTTCGTGGTATTCTTGC
## >GENE-6
## ATGGCAGATTCCAACGGTACTATTACCGTTGAAGAGCTTAAAAAGCTCCTTGAACAATGG
## AACCTAGTAATAGGTTTCCTATTCCTTACATGGATTTGTCTTCTACAATTTGCCTATGCC
## AACAGGAATAGGTTTTTGTATATAATTAAGTTAATTTTCCTCTGGCTGT
## >GENE-7
## ATGTTTCATCTCGTTGACTTTCAGGTTACTATAGCAGAGATATTACTAATTATTATGAGG
## ACTTTTAAAGTTTCCAT
## >GENE-8
## ATGAAAATTATTCTTTTCTTGGCACTGATAACACTCGCTACTTGTGAGCTTTATCACTAC
## CAAGAGTGTGTTAGAGGTACAACAGTACTTTTAAAAGAACCTTGCTCTTCTGGAACATAC
## GAGGGCAATTCACCATTTCATCCTCTAGCTGATAACAAATTTGCACTGACTTGCTTTAGC
## ACTCA
## >GENE-9
## ATGATTGAACTTTCATTAATTGACTTCTATTTGTGCTTTTTAGCCTTTCTGCTATTCCTT
## GTTTTAATTATGCTTATTATCTTTTGGTTCTCACTTGAACTG
## >GENE-10
## ATGAAATTTCTTGTTTTCTTAGGAATCATCACAACTGTAGCTGCATTTCACCAAGAATGT
## AGTTTACAGTCATGTACTCAACATCAACCATATGTAGTTGATGACCCGTGTCCTATTCAC
## TTCTATTCTAAATGGTATATTAGAGTAGGAGCTAGAAAATC

3.2.2 Calculate statistics

Display help information for stats subcommand:

bash
seqkit stats --help
## simple statistics of FASTA/Q files
## 
## Tips:
##   1. For lots of small files (especially on SDD), use big value of '-j' to
##      parallelize counting.
## 
## Usage:
##   seqkit stats [flags]
## 
## Aliases:
##   stats, stat
## 
## Flags:
##   -a, --all                  all statistics, including quartiles of seq length, sum_gap, N50
##   -b, --basename             only output basename of files
##   -E, --fq-encoding string   fastq quality encoding. available values: 'sanger', 'solexa', 'illumina-1.3+', 'illumina-1.5+', 'illumina-1.8+'. (default "sanger")
##   -G, --gap-letters string   gap letters (default "- .")
##   -h, --help                 help for stats
##   -e, --skip-err             skip error, only show warning message
##   -i, --stdin-label string   label for replacing default "-" for stdin (default "-")
##   -T, --tabular              output in machine-friendly tabular format
## 
## Global Flags:
##       --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
##       --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
##       --id-regexp string                regular expression for parsing ID (default "^(\\S+)\\s?")
##       --infile-list string              file of input files list (one file per line), if given, they are appended to files from cli arguments
##   -w, --line-width int                  line width when outputting FASTA format (0 for no wrap) (default 60)
##   -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
##       --quiet                           be quiet and do not show extra information
##   -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
##   -j, --threads int                     number of CPUs. can also set with environment variable SEQKIT_THREADS) (default 4)

Calculate simple statistics:

bash
seqkit stats tutorial/data/example/genes.fasta
## file                               format  type  num_seqs  sum_len  min_len  avg_len  max_len
## tutorial/data/example/genes.fasta  FASTA   DNA         10    1,771       77    177.1      335

Output simple statistics in tabular format:

bash
seqkit stats --tabular tutorial/data/example/genes.fasta
## file format  type    num_seqs    sum_len min_len avg_len max_len
## tutorial/data/example/genes.fasta    FASTA   DNA 10  1771    77  177.1   335

Calculate advanced statistics:

bash
seqkit stats --all tutorial/data/example/genes.fasta
## file                               format  type  num_seqs  sum_len  min_len  avg_len  max_len   Q1   Q2   Q3  sum_gap  N50  Q20(%)  Q30(%)  GC(%)
## tutorial/data/example/genes.fasta  FASTA   DNA         10    1,771       77    177.1      335  102  166  232        0  185       0       0   39.3

3.2.3 Extract subsequences

Display help information for subseq subcommand:

bash
seqkit subseq --help
## get subsequences by region/gtf/bed, including flanking sequences.
## 
## Attentions:
##   1. Use "seqkit grep" for extract subsets of sequences.
##      "seqtk subseq seqs.fasta id.txt" equals to
##      "seqkit grep -f id.txt seqs.fasta"
## 
## Recommendation:
##   1. use plain FASTA file, so seqkit could utilize FASTA index.
## 
## The definition of region is 1-based and with some custom design.
## 
## Examples:
## 
##  1-based index    1 2 3 4 5 6 7 8 9 10
## negative index    0-9-8-7-6-5-4-3-2-1
##            seq    A C G T N a c g t n
##            1:1    A
##            2:4      C G T
##          -4:-2                c g t
##          -4:-1                c g t n
##          -1:-1                      n
##           2:-2      C G T N a c g t
##           1:-1    A C G T N a c g t n
##           1:12    A C G T N a c g t n
##         -12:-1    A C G T N a c g t n
## 
## Usage:
##   seqkit subseq [flags]
## 
## Flags:
##       --bed string        by tab-delimited BED file
##       --chr strings       select limited sequence with sequence IDs when using --gtf or --bed (multiple value supported, case ignored)
##   -d, --down-stream int   down stream length
##       --feature strings   select limited feature types (multiple value supported, case ignored, only works with GTF)
##       --gtf string        by GTF (version 2.2) file
##       --gtf-tag string    output this tag as sequence comment (default "gene_id")
##   -h, --help              help for subseq
##   -f, --only-flank        only return up/down stream sequence
##   -r, --region string     by region. e.g 1:12 for first 12 bases, -12:-1 for last 12 bases, 13:-1 for cutting first 12 bases. type "seqkit subseq -h" for more examples
##   -u, --up-stream int     up stream length
## 
## Global Flags:
##       --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
##       --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
##       --id-regexp string                regular expression for parsing ID (default "^(\\S+)\\s?")
##       --infile-list string              file of input files list (one file per line), if given, they are appended to files from cli arguments
##   -w, --line-width int                  line width when outputting FASTA format (0 for no wrap) (default 60)
##   -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
##       --quiet                           be quiet and do not show extra information
##   -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
##   -j, --threads int                     number of CPUs. can also set with environment variable SEQKIT_THREADS) (default 4)

Get the first 10 bases from each sequence:

bash
seqkit subseq --region 1:10 tutorial/data/example/genes.fasta
## [INFO] create FASTA index for tutorial/data/example/genes.fasta
## >GENE-1
## ATTAAAGGTT
## >GENE-2
## ATGGAGAGCC
## >GENE-3
## ATGTTTGTTT
## >GENE-4
## ATGGATTTGT
## >GENE-5
## ATGTACTCAT
## >GENE-6
## ATGGCAGATT
## >GENE-7
## ATGTTTCATC
## >GENE-8
## ATGAAAATTA
## >GENE-9
## ATGATTGAAC
## >GENE-10
## ATGAAATTTC

Get the last 10 bases from each sequence:

bash
seqkit subseq --region -10:-1 tutorial/data/example/genes.fasta
## >GENE-1
## GAAAGGTAAG
## >GENE-2
## GATGCTCGAA
## >GENE-3
## CTTTAGATTC
## >GENE-4
## CTTCTTGCTG
## >GENE-5
## GTATTCTTGC
## >GENE-6
## CTCTGGCTGT
## >GENE-7
## AAGTTTCCAT
## >GENE-8
## TTAGCACTCA
## >GENE-9
## ACTTGAACTG
## >GENE-10
## CTAGAAAATC

3.3 Convert between file formats

Next, we will look at how to convert between the FASTA/Q file formats. Remember, it only makes sense to convert from FASTQ to FASTA and not the other way around.

3.3.1 FASTQ to FASTA

Display help information for fq2fa subcommand:

bash
seqkit fq2fa --help
## convert FASTQ to FASTA
## 
## Usage:
##   seqkit fq2fa [flags]
## 
## Flags:
##   -h, --help   help for fq2fa
## 
## Global Flags:
##       --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
##       --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
##       --id-regexp string                regular expression for parsing ID (default "^(\\S+)\\s?")
##       --infile-list string              file of input files list (one file per line), if given, they are appended to files from cli arguments
##   -w, --line-width int                  line width when outputting FASTA format (0 for no wrap) (default 60)
##   -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
##       --quiet                           be quiet and do not show extra information
##   -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
##   -j, --threads int                     number of CPUs. can also set with environment variable SEQKIT_THREADS) (default 4)

Convert FASTQ to FASTA:

bash
seqkit fq2fa tutorial/data/example/reads.fastq
## >READ-1
## CCTAAGAATACTGGAGAAATTTAGCTGGAAAGATAC
## >READ-2
## ACCATCCTGTGGTGAGGTAGCGGGTTCTCTGCGTCTCT
## >READ-3
## AGATTCTGCAAAGGAATGTCCAGCATCTGAAAATAA
## >READ-4
## CTGCTTCAAGGATGAATTCAGGGACCTTCAGACCCTGT
## >READ-5
## GTGGGTGGATTGATGGAAGTCCCTTGGCTGGCTGGCTG
## >READ-6
## ACATCTTTACGAGTTATATCCTTCTGCAAACCAACAT
## >READ-7
## ATCATTTTTGTAGAAAATCTTCTACCACATGTTAGAC
## >READ-8
## ACAGGTGGCATTTAGCATCTTACTATCCTGAAAGAACT
## >READ-9
## TGCTGGGTGGTCCTGAAGGAACCCCCCAGCTCTCTGCT
## >READ-10
## AAAAAGTGGAAAATTTAGAAATGTCCACTGTAGGACTT

3.3.2 FASTA/Q to TSV

Display help information for fx2tab subcommand:

bash
seqkit fx2tab --help
## convert FASTA/Q to tabular format, and provide various information,
## like sequence length, GC content/GC skew.
## 
## Attention:
##   1. Fixed three columns (ID, sequence, quality) are outputted for either FASTA
##      or FASTQ, except when flag -n/--name is on. This is for format compatibility.
## 
## Usage:
##   seqkit fx2tab [flags]
## 
## Flags:
##   -a, --alphabet               print alphabet letters
##   -q, --avg-qual               print average quality of a read
##   -B, --base-content strings   print base content. (case ignored, multiple values supported) e.g. -B AT -B N
##   -C, --base-count strings     print base count. (case ignored, multiple values supported) e.g. -C AT -C N
##   -I, --case-sensitive         calculate case sensitive base content/sequence hash
##   -g, --gc                     print GC content
##   -G, --gc-skew                print GC-Skew
##   -H, --header-line            print header line
##   -h, --help                   help for fx2tab
##   -l, --length                 print sequence length
##   -n, --name                   only print names (no sequences and qualities)
##   -Q, --no-qual                only output two column even for FASTQ file
##   -i, --only-id                print ID instead of full head
##   -b, --qual-ascii-base int    ASCII BASE, 33 for Phred+33 (default 33)
##   -s, --seq-hash               print hash (MD5) of sequence
## 
## Global Flags:
##       --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
##       --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
##       --id-regexp string                regular expression for parsing ID (default "^(\\S+)\\s?")
##       --infile-list string              file of input files list (one file per line), if given, they are appended to files from cli arguments
##   -w, --line-width int                  line width when outputting FASTA format (0 for no wrap) (default 60)
##   -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
##       --quiet                           be quiet and do not show extra information
##   -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
##   -j, --threads int                     number of CPUs. can also set with environment variable SEQKIT_THREADS) (default 4)

Output FASTQ in tabular format:

bash
seqkit fx2tab tutorial/data/example/reads.fastq
## READ-1   CCTAAGAATACTGGAGAAATTTAGCTGGAAAGATAC    AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
## READ-2   ACCATCCTGTGGTGAGGTAGCGGGTTCTCTGCGTCTCT  AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
## READ-3   AGATTCTGCAAAGGAATGTCCAGCATCTGAAAATAA    AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
## READ-4   CTGCTTCAAGGATGAATTCAGGGACCTTCAGACCCTGT  AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEE
## READ-5   GTGGGTGGATTGATGGAAGTCCCTTGGCTGGCTGGCTG  AAAAAEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEE
## READ-6   ACATCTTTACGAGTTATATCCTTCTGCAAACCAACAT   AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEE
## READ-7   ATCATTTTTGTAGAAAATCTTCTACCACATGTTAGAC   AAAAAEEAEEEEEEEE6EEEEEEEEEEEEEEEE/EEE
## READ-8   ACAGGTGGCATTTAGCATCTTACTATCCTGAAAGAACT  AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
## READ-9   TGCTGGGTGGTCCTGAAGGAACCCCCCAGCTCTCTGCT  AAAAAEEEEEEAEEE6EEEA6EEEEEE/E/EEAEAEAA
## READ-10  AAAAAGTGGAAAATTTAGAAATGTCCACTGTAGGACTT  AAAAAEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEE

Output specific FASTQ information in tabular format:

bash
# print sequence name, length, and GC content
seqkit fx2tab --name --length --gc tutorial/data/example/reads.fastq
## READ-1   36  36.11
## READ-2   38  57.89
## READ-3   36  36.11
## READ-4   38  50.00
## READ-5   38  60.53
## READ-6   37  35.14
## READ-7   37  29.73
## READ-8   38  39.47
## READ-9   38  63.16
## READ-10  38  31.58

3.4 Search for sub-sequences

Now we will look at how to search for small sub-sequences within the sequences stored in FASTA/Q files.

3.4.1 Search sequence

Display help information for grep subcommand:

bash
seqkit grep --help
## search sequences by ID/name/sequence/sequence motifs, mismatch allowed
## 
## Attentions:
## 
##   0. By default, we match sequence ID with patterns, use "-n/--by-name"
##      for matching full name instead of just ID.
##   1. Unlike POSIX/GNU grep, we compare the pattern to the whole target
##      (ID/full header) by default. Please switch "-r/--use-regexp" on
##      for partly matching.
##   2. When searching by sequences, it's partly matching, and both positive
##      and negative strands are searched.
##      Mismatch is allowed using flag "-m/--max-mismatch", you can increase
##      the value of "-j/--threads" to accelerate processing.
##   3. Degenerate bases/residues like "RYMM.." are also supported by flag -d.
##      But do not use degenerate bases/residues in regular expression, you need
##      convert them to regular expression, e.g., change "N" or "X"  to ".".
##   4. When providing search patterns (motifs) via flag '-p',
##      please use double quotation marks for patterns containing comma, 
##      e.g., -p '"A{2,}"' or -p "\"A{2,}\"". Because the command line argument
##      parser accepts comma-separated-values (CSV) for multiple values (motifs).
##      Patterns in file do not follow this rule.
##   5. The order of sequences in result is consistent with that in original
##      file, not the order of the query patterns. 
##      But for FASTA file, you can use:
##         seqkit faidx seqs.fasta --infile-list IDs.txt
##   6. For multiple patterns, you can either set "-p" multiple times, i.e.,
##      -p pattern1 -p pattern2, or give a file of patterns via "-f/--pattern-file".
## 
## You can specify the sequence region for searching with the flag -R (--region).
## The definition of region is 1-based and with some custom design.
## 
## Examples:
## 
##  1-based index    1 2 3 4 5 6 7 8 9 10
## negative index    0-9-8-7-6-5-4-3-2-1
##            seq    A C G T N a c g t n
##            1:1    A
##            2:4      C G T
##          -4:-2                c g t
##          -4:-1                c g t n
##          -1:-1                      n
##           2:-2      C G T N a c g t
##           1:-1    A C G T N a c g t n
##           1:12    A C G T N a c g t n
##         -12:-1    A C G T N a c g t n
## 
## Usage:
##   seqkit grep [flags]
## 
## Flags:
##   -n, --by-name                match by full name instead of just ID
##   -s, --by-seq                 search subseq on seq, both positive and negative strand are searched, and mismatch allowed using flag -m/--max-mismatch
##   -c, --circular               circular genome
##   -C, --count                  just print a count of matching records. with the -v/--invert-match flag, count non-matching records
##   -d, --degenerate             pattern/motif contains degenerate base
##       --delete-matched         delete a pattern right after being matched, this keeps the firstly matched data and speedups when using regular expressions
##   -h, --help                   help for grep
##   -i, --ignore-case            ignore case
##   -I, --immediate-output       print output immediately, do not use write buffer
##   -v, --invert-match           invert the sense of matching, to select non-matching records
##   -m, --max-mismatch int       max mismatch when matching by seq. For large genomes like human genome, using mapping/alignment tools would be faster
##   -P, --only-positive-strand   only search on positive strand
##   -p, --pattern strings        search pattern (multiple values supported. Attention: use double quotation marks for patterns containing comma, e.g., -p '"A{2,}"'))
##   -f, --pattern-file string    pattern file (one record per line)
##   -R, --region string          specify sequence region for searching. e.g 1:12 for first 12 bases, -12:-1 for last 12 bases
##   -r, --use-regexp             patterns are regular expression
## 
## Global Flags:
##       --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
##       --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
##       --id-regexp string                regular expression for parsing ID (default "^(\\S+)\\s?")
##       --infile-list string              file of input files list (one file per line), if given, they are appended to files from cli arguments
##   -w, --line-width int                  line width when outputting FASTA format (0 for no wrap) (default 60)
##   -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
##       --quiet                           be quiet and do not show extra information
##   -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
##   -j, --threads int                     number of CPUs. can also set with environment variable SEQKIT_THREADS) (default 4)

Extract sequences with matching name:

bash
seqkit grep --by-name --pattern "gene-GU280_gp04" tutorial/data/example/genes.fasta

Extract sequences containing AGGCG subsequence:

bash
seqkit grep --by-seq --pattern "AGGCG" tutorial/data/example/genes.fasta
## >GENE-2
## ATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTT
## TTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCA
## GAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTT
## TTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGATGCTCGAA

Extract sequences containing AGGCG in the first N bases:

bash
# sequence region for searching (1:10 for first 10 bases)
seqkit grep --by-seq --pattern "AGGCG" --region 1:10 tutorial/data/example/genes.fasta

3.4.2 Locate subsequence

Display help information for locate subcommand:

bash
seqkit locate --help
## locate subsequences/motifs, mismatch allowed
## 
## Attentions:
## 
##   1. Motifs could be EITHER plain sequence containing "ACTGN" OR regular
##      expression like "A[TU]G(?:.{3})+?[TU](?:AG|AA|GA)" for ORFs.     
##   2. Degenerate bases/residues like "RYMM.." are also supported by flag -d.
##      But do not use degenerate bases/residues in regular expression, you need
##      convert them to regular expression, e.g., change "N" or "X"  to ".".
##   3. When providing search patterns (motifs) via flag '-p',
##      please use double quotation marks for patterns containing comma, 
##      e.g., -p '"A{2,}"' or -p "\"A{2,}\"". Because the command line argument
##      parser accepts comma-separated-values (CSV) for multiple values (motifs).
##      Patterns in file do not follow this rule.     
##   4. Mismatch is allowed using flag "-m/--max-mismatch",
##      you can increase the value of "-j/--threads" to accelerate processing.
##   5. When using flag --circular, end position of matched subsequence that 
##      crossing genome sequence end would be greater than sequence length.
## 
## Usage:
##   seqkit locate [flags]
## 
## Flags:
##       --bed                       output in BED6 format
##   -c, --circular                  circular genome. type "seqkit locate -h" for details
##   -d, --degenerate                pattern/motif contains degenerate base
##       --gtf                       output in GTF format
##   -h, --help                      help for locate
##   -M, --hide-matched              do not show matched sequences
##   -i, --ignore-case               ignore case
##   -I, --immediate-output          print output immediately, do not use write buffer
##   -m, --max-mismatch int          max mismatch when matching by seq. For large genomes like human genome, using mapping/alignment tools would be faster
##   -G, --non-greedy                non-greedy mode, faster but may miss motifs overlapping with others
##   -P, --only-positive-strand      only search on positive strand
##   -p, --pattern strings           pattern/motif (multiple values supported. Attention: use double quotation marks for patterns containing comma, e.g., -p '"A{2,}"')
##   -f, --pattern-file string       pattern/motif file (FASTA format)
##   -F, --use-fmi                   use FM-index for much faster search of lots of sequence patterns
##   -r, --use-regexp                patterns/motifs are regular expression
##   -V, --validate-seq-length int   length of sequence to validate (0 for whole seq) (default 10000)
## 
## Global Flags:
##       --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
##       --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
##       --id-regexp string                regular expression for parsing ID (default "^(\\S+)\\s?")
##       --infile-list string              file of input files list (one file per line), if given, they are appended to files from cli arguments
##   -w, --line-width int                  line width when outputting FASTA format (0 for no wrap) (default 60)
##   -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
##       --quiet                           be quiet and do not show extra information
##   -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
##   -j, --threads int                     number of CPUs. can also set with environment variable SEQKIT_THREADS) (default 4)

Locate a subsequence in each sequence:

bash
# exact match
seqkit locate --pattern "ACGT" tutorial/data/example/genes.fasta
## seqID    patternName pattern strand  start   end matched
## GENE-2   ACGT    ACGT    +   38  41  ACGT
## GENE-2   ACGT    ACGT    +   74  77  ACGT
## GENE-2   ACGT    ACGT    +   84  87  ACGT
## GENE-2   ACGT    ACGT    +   126 129 ACGT
## GENE-2   ACGT    ACGT    +   216 219 ACGT
## GENE-2   ACGT    ACGT    -   216 219 ACGT
## GENE-2   ACGT    ACGT    -   126 129 ACGT
## GENE-2   ACGT    ACGT    -   84  87  ACGT
## GENE-2   ACGT    ACGT    -   74  77  ACGT
## GENE-2   ACGT    ACGT    -   38  41  ACGT
## GENE-3   ACGT    ACGT    +   99  102 ACGT
## GENE-3   ACGT    ACGT    -   99  102 ACGT
## GENE-5   ACGT    ACGT    +   31  34  ACGT
## GENE-5   ACGT    ACGT    -   31  34  ACGT

Locate a subsequence with N mismatches:

bash
# allow 1 mismatch
seqkit locate --pattern "ACGT" --max-mismatch 1 tutorial/data/example/genes.fasta
## seqID    patternName pattern strand  start   end matched
## GENE-1   ACGT    ACGT    +   6   9   AGGT
## GENE-1   ACGT    ACGT    +   14  17  ACCT
## GENE-1   ACGT    ACGT    +   22  25  AGGT
## GENE-1   ACGT    ACGT    +   39  42  ACTT
## GENE-1   ACGT    ACGT    +   70  73  ACGA
## GENE-1   ACGT    ACGT    +   74  77  ACTT
## GENE-1   ACGT    ACGT    +   122 125 ACGC
## GENE-1   ACGT    ACGT    +   150 153 TCGT
## GENE-1   ACGT    ACGT    +   163 166 ACGA
## GENE-1   ACGT    ACGT    +   172 175 TCGT
## GENE-1   ACGT    ACGT    +   196 199 ACGG
## GENE-1   ACGT    ACGT    +   202 205 TCGT
## GENE-1   ACGT    ACGT    +   206 209 CCGT
## GENE-1   ACGT    ACGT    +   229 232 ACAT
## GENE-1   ACGT    ACGT    +   235 238 AGGT
## GENE-1   ACGT    ACGT    +   240 243 TCGT
## GENE-1   ACGT    ACGT    +   259 262 AGGT
## GENE-1   ACGT    ACGT    -   259 262 ACCT
## GENE-1   ACGT    ACGT    -   240 243 ACGA
## GENE-1   ACGT    ACGT    -   235 238 ACCT
## GENE-1   ACGT    ACGT    -   229 232 ATGT
## GENE-1   ACGT    ACGT    -   206 209 ACGG
## GENE-1   ACGT    ACGT    -   202 205 ACGA
## GENE-1   ACGT    ACGT    -   196 199 CCGT
## GENE-1   ACGT    ACGT    -   172 175 ACGA
## GENE-1   ACGT    ACGT    -   163 166 TCGT
## GENE-1   ACGT    ACGT    -   150 153 ACGA
## GENE-1   ACGT    ACGT    -   122 125 GCGT
## GENE-1   ACGT    ACGT    -   74  77  AAGT
## GENE-1   ACGT    ACGT    -   70  73  TCGT
## GENE-1   ACGT    ACGT    -   39  42  AAGT
## GENE-1   ACGT    ACGT    -   22  25  ACCT
## GENE-1   ACGT    ACGT    -   14  17  AGGT
## GENE-1   ACGT    ACGT    -   6   9   ACCT
## GENE-2   ACGT    ACGT    +   26  29  ACGA
## GENE-2   ACGT    ACGT    +   38  41  ACGT
## GENE-2   ACGT    ACGT    +   65  68  AGGT
## GENE-2   ACGT    ACGT    +   74  77  ACGT
## GENE-2   ACGT    ACGT    +   80  83  TCGT
## GENE-2   ACGT    ACGT    +   84  87  ACGT
## GENE-2   ACGT    ACGT    +   101 104 CCGT
## GENE-2   ACGT    ACGT    +   110 113 AGGT
## GENE-2   ACGT    ACGT    +   126 129 ACGT
## GENE-2   ACGT    ACGT    +   132 135 ACAT
## GENE-2   ACGT    ACGT    +   148 151 ACTT
## GENE-2   ACGT    ACGT    +   164 167 AAGT
## GENE-2   ACGT    ACGT    +   176 179 GCGT
## GENE-2   ACGT    ACGT    +   189 192 ACTT
## GENE-2   ACGT    ACGT    +   203 206 ATGT
## GENE-2   ACGT    ACGT    +   216 219 ACGT
## GENE-2   ACGT    ACGT    -   216 219 ACGT
## GENE-2   ACGT    ACGT    -   203 206 ACAT
## GENE-2   ACGT    ACGT    -   189 192 AAGT
## GENE-2   ACGT    ACGT    -   176 179 ACGC
## GENE-2   ACGT    ACGT    -   164 167 ACTT
## GENE-2   ACGT    ACGT    -   148 151 AAGT
## GENE-2   ACGT    ACGT    -   132 135 ATGT
## GENE-2   ACGT    ACGT    -   126 129 ACGT
## GENE-2   ACGT    ACGT    -   110 113 ACCT
## GENE-2   ACGT    ACGT    -   101 104 ACGG
## GENE-2   ACGT    ACGT    -   84  87  ACGT
## GENE-2   ACGT    ACGT    -   80  83  ACGA
## GENE-2   ACGT    ACGT    -   74  77  ACGT
## GENE-2   ACGT    ACGT    -   65  68  ACCT
## GENE-2   ACGT    ACGT    -   38  41  ACGT
## GENE-2   ACGT    ACGT    -   26  29  TCGT
## GENE-3   ACGT    ACGT    +   1   4   ATGT
## GENE-3   ACGT    ACGT    +   99  102 ACGT
## GENE-3   ACGT    ACGT    +   122 125 AAGT
## GENE-3   ACGT    ACGT    +   144 147 ACAT
## GENE-3   ACGT    ACGT    +   158 161 ACTT
## GENE-3   ACGT    ACGT    +   168 171 ACCT
## GENE-3   ACGT    ACGT    +   182 185 ATGT
## GENE-3   ACGT    ACGT    +   187 190 ACTT
## GENE-3   ACGT    ACGT    +   204 207 ACAT
## GENE-3   ACGT    ACGT    +   206 209 ATGT
## GENE-3   ACGT    ACGT    +   232 235 AGGT
## GENE-3   ACGT    ACGT    +   289 292 AAGT
## GENE-3   ACGT    ACGT    +   296 299 ACAT
## GENE-3   ACGT    ACGT    +   325 328 ACTT
## GENE-3   ACGT    ACGT    -   325 328 AAGT
## GENE-3   ACGT    ACGT    -   296 299 ATGT
## GENE-3   ACGT    ACGT    -   289 292 ACTT
## GENE-3   ACGT    ACGT    -   232 235 ACCT
## GENE-3   ACGT    ACGT    -   206 209 ACAT
## GENE-3   ACGT    ACGT    -   204 207 ATGT
## GENE-3   ACGT    ACGT    -   187 190 AAGT
## GENE-3   ACGT    ACGT    -   182 185 ACAT
## GENE-3   ACGT    ACGT    -   168 171 AGGT
## GENE-3   ACGT    ACGT    -   158 161 AAGT
## GENE-3   ACGT    ACGT    -   144 147 ATGT
## GENE-3   ACGT    ACGT    -   122 125 ACTT
## GENE-3   ACGT    ACGT    -   99  102 ACGT
## GENE-3   ACGT    ACGT    -   1   4   ACAT
## GENE-4   ACGT    ACGT    +   40  43  ACTT
## GENE-4   ACGT    ACGT    +   51  54  AGGT
## GENE-4   ACGT    ACGT    +   100 103 ACGA
## GENE-4   ACGT    ACGT    +   146 149 GCGT
## GENE-4   ACGT    ACGT    +   153 156 ACTT
## GENE-4   ACGT    ACGT    -   153 156 AAGT
## GENE-4   ACGT    ACGT    -   146 149 ACGC
## GENE-4   ACGT    ACGT    -   100 103 TCGT
## GENE-4   ACGT    ACGT    -   51  54  ACCT
## GENE-4   ACGT    ACGT    -   40  43  AAGT
## GENE-5   ACGT    ACGT    +   1   4   ATGT
## GENE-5   ACGT    ACGT    +   11  14  TCGT
## GENE-5   ACGT    ACGT    +   27  30  AGGT
## GENE-5   ACGT    ACGT    +   31  34  ACGT
## GENE-5   ACGT    ACGT    +   47  50  GCGT
## GENE-5   ACGT    ACGT    +   51  54  ACTT
## GENE-5   ACGT    ACGT    +   68  71  TCGT
## GENE-5   ACGT    ACGT    -   68  71  ACGA
## GENE-5   ACGT    ACGT    -   51  54  AAGT
## GENE-5   ACGT    ACGT    -   47  50  ACGC
## GENE-5   ACGT    ACGT    -   31  34  ACGT
## GENE-5   ACGT    ACGT    -   27  30  ACCT
## GENE-5   ACGT    ACGT    -   11  14  ACGA
## GENE-5   ACGT    ACGT    -   1   4   ACAT
## GENE-6   ACGT    ACGT    +   14  17  ACGG
## GENE-6   ACGT    ACGT    +   26  29  CCGT
## GENE-6   ACGT    ACGT    +   62  65  ACCT
## GENE-6   ACGT    ACGT    +   72  75  AGGT
## GENE-6   ACGT    ACGT    +   88  91  ACAT
## GENE-6   ACGT    ACGT    +   130 133 AGGT
## GENE-6   ACGT    ACGT    +   148 151 AAGT
## GENE-6   ACGT    ACGT    -   148 151 ACTT
## GENE-6   ACGT    ACGT    -   130 133 ACCT
## GENE-6   ACGT    ACGT    -   88  91  ATGT
## GENE-6   ACGT    ACGT    -   72  75  ACCT
## GENE-6   ACGT    ACGT    -   62  65  AGGT
## GENE-6   ACGT    ACGT    -   26  29  ACGG
## GENE-6   ACGT    ACGT    -   14  17  CCGT
## GENE-7   ACGT    ACGT    +   1   4   ATGT
## GENE-7   ACGT    ACGT    +   11  14  TCGT
## GENE-7   ACGT    ACGT    +   17  20  ACTT
## GENE-7   ACGT    ACGT    +   23  26  AGGT
## GENE-7   ACGT    ACGT    +   61  64  ACTT
## GENE-7   ACGT    ACGT    +   68  71  AAGT
## GENE-7   ACGT    ACGT    -   68  71  ACTT
## GENE-7   ACGT    ACGT    -   61  64  AAGT
## GENE-7   ACGT    ACGT    -   23  26  ACCT
## GENE-7   ACGT    ACGT    -   17  20  AAGT
## GENE-7   ACGT    ACGT    -   11  14  ACGA
## GENE-7   ACGT    ACGT    -   1   4   ACAT
## GENE-8   ACGT    ACGT    +   40  43  ACTT
## GENE-8   ACGT    ACGT    +   75  78  AGGT
## GENE-8   ACGT    ACGT    +   87  90  ACTT
## GENE-8   ACGT    ACGT    +   99  102 ACCT
## GENE-8   ACGT    ACGT    +   115 118 ACAT
## GENE-8   ACGT    ACGT    +   119 122 ACGA
## GENE-8   ACGT    ACGT    +   169 172 ACTT
## GENE-8   ACGT    ACGT    -   169 172 AAGT
## GENE-8   ACGT    ACGT    -   119 122 TCGT
## GENE-8   ACGT    ACGT    -   115 118 ATGT
## GENE-8   ACGT    ACGT    -   99  102 AGGT
## GENE-8   ACGT    ACGT    -   87  90  AAGT
## GENE-8   ACGT    ACGT    -   75  78  ACCT
## GENE-8   ACGT    ACGT    -   40  43  AAGT
## GENE-9   ACGT    ACGT    +   9   12  ACTT
## GENE-9   ACGT    ACGT    +   23  26  ACTT
## GENE-9   ACGT    ACGT    +   93  96  ACTT
## GENE-9   ACGT    ACGT    -   93  96  AAGT
## GENE-9   ACGT    ACGT    -   23  26  AAGT
## GENE-9   ACGT    ACGT    -   9   12  AAGT
## GENE-10  ACGT    ACGT    +   57  60  ATGT
## GENE-10  ACGT    ACGT    +   72  75  ATGT
## GENE-10  ACGT    ACGT    +   81  84  ACAT
## GENE-10  ACGT    ACGT    +   92  95  ATGT
## GENE-10  ACGT    ACGT    +   106 109 CCGT
## GENE-10  ACGT    ACGT    +   119 122 ACTT
## GENE-10  ACGT    ACGT    -   119 122 AAGT
## GENE-10  ACGT    ACGT    -   106 109 ACGG
## GENE-10  ACGT    ACGT    -   92  95  ACAT
## GENE-10  ACGT    ACGT    -   81  84  ATGT
## GENE-10  ACGT    ACGT    -   72  75  ACAT
## GENE-10  ACGT    ACGT    -   57  60  ACAT

Locate a subsequence using regular expressions:

bash
# match AAGC or GGGC
seqkit locate --pattern "(A{2}|G{2})GC" --use-regexp tutorial/data/example/genes.fasta
## seqID    patternName pattern strand  start   end matched
## GENE-1   (A{2}|G{2})GC   (A{2}|G{2})GC   -   192 195 AAGC
## GENE-1   (A{2}|G{2})GC   (A{2}|G{2})GC   -   109 112 AAGC
## GENE-2   (A{2}|G{2})GC   (A{2}|G{2})GC   -   198 201 GGGC
## GENE-2   (A{2}|G{2})GC   (A{2}|G{2})GC   -   155 158 AAGC
## GENE-2   (A{2}|G{2})GC   (A{2}|G{2})GC   -   89  92  AAGC
## GENE-3   (A{2}|G{2})GC   (A{2}|G{2})GC   -   277 280 AAGC
## GENE-4   (A{2}|G{2})GC   (A{2}|G{2})GC   +   46  49  AAGC
## GENE-4   (A{2}|G{2})GC   (A{2}|G{2})GC   +   113 116 AAGC
## GENE-4   (A{2}|G{2})GC   (A{2}|G{2})GC   -   135 138 AAGC
## GENE-5   (A{2}|G{2})GC   (A{2}|G{2})GC   -   64  67  AAGC
## GENE-6   (A{2}|G{2})GC   (A{2}|G{2})GC   +   43  46  AAGC
## GENE-6   (A{2}|G{2})GC   (A{2}|G{2})GC   -   36  39  AAGC
## GENE-8   (A{2}|G{2})GC   (A{2}|G{2})GC   +   123 126 GGGC
## GENE-8   (A{2}|G{2})GC   (A{2}|G{2})GC   -   173 176 AAGC
## GENE-8   (A{2}|G{2})GC   (A{2}|G{2})GC   -   48  51  AAGC
## GENE-9   (A{2}|G{2})GC   (A{2}|G{2})GC   -   72  75  AAGC
## GENE-9   (A{2}|G{2})GC   (A{2}|G{2})GC   -   35  38  AAGC

3.5 Perform set operations

3.5.1 Sample sequences

Display help information for sample subcommand:

bash
seqkit sample --help
## sample sequences by number or proportion.
## 
## Attention:
## 1. Do not use '-n' on large FASTQ files, it loads all seqs into memory!
##    use 'seqkit sample -p 0.1 seqs.fq.gz | seqkit head -n N' instead!
## 
## Usage:
##   seqkit sample [flags]
## 
## Flags:
##   -h, --help               help for sample
##   -n, --number int         sample by number (result may not exactly match), DO NOT use on large FASTQ files.
##   -p, --proportion float   sample by proportion
##   -s, --rand-seed int      rand seed (default 11)
##   -2, --two-pass           2-pass mode read files twice to lower memory usage. Not allowed when reading from stdin
## 
## Global Flags:
##       --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
##       --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
##       --id-regexp string                regular expression for parsing ID (default "^(\\S+)\\s?")
##       --infile-list string              file of input files list (one file per line), if given, they are appended to files from cli arguments
##   -w, --line-width int                  line width when outputting FASTA format (0 for no wrap) (default 60)
##   -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
##       --quiet                           be quiet and do not show extra information
##   -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
##   -j, --threads int                     number of CPUs. can also set with environment variable SEQKIT_THREADS) (default 4)

Sample sequences by number:

bash
# set seed to reproduce result
seqkit sample --number 5 --rand-seed 1701 tutorial/data/example/genes.fasta
## [INFO] sample by number
## [INFO] loading all sequences into memory...
## [INFO] 4 sequences outputted
## >GENE-2
## ATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTT
## TTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCA
## GAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTT
## TTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGATGCTCGAA
## >GENE-3
## ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACC
## AGAACTCAATTACCCCCTGCATACACTAATTCTTTCACACGTGGTGTTTATTACCCTGAC
## AAAGTTTTCAGATCCTCAGTTTTACATTCAACTCAGGACTTGTTCTTACCTTTCTTTTCC
## AATGTTACTTGGTTCCATGCTATACATGTCTCTGGGACCAATGGTACTAAGAGGTTTGAT
## AACCCTGTCCTACCATTTAATGATGGTGTTTATTTTGCTTCCACTGAGAAGTCTAACATA
## ATAAGAGGCTGGATTTTTGGTACTACTTTAGATTC
## >GENE-4
## ATGGATTTGTTTATGAGAATCTTCACAATTGGAACTGTAACTTTGAAGCAAGGTGAAATC
## AAGGATGCTACTCCTTCAGATTTTGTTCGCGCTACTGCAACGATACCGATACAAGCCTCA
## CTCCCTTTCGGATGGCTTATTGTTGGCGTTGCACTTCTTGCTG
## >GENE-10
## ATGAAATTTCTTGTTTTCTTAGGAATCATCACAACTGTAGCTGCATTTCACCAAGAATGT
## AGTTTACAGTCATGTACTCAACATCAACCATATGTAGTTGATGACCCGTGTCCTATTCAC
## TTCTATTCTAAATGGTATATTAGAGTAGGAGCTAGAAAATC

Sample sequences by proportion:

bash
# set seed to reproduce result
seqkit sample --proportion 0.25 --rand-seed 1701 tutorial/data/example/genes.fasta
## [INFO] sample by proportion
## [INFO] 3 sequences outputted
## >GENE-2
## ATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTT
## TTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCA
## GAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTT
## TTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGATGCTCGAA
## >GENE-3
## ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACC
## AGAACTCAATTACCCCCTGCATACACTAATTCTTTCACACGTGGTGTTTATTACCCTGAC
## AAAGTTTTCAGATCCTCAGTTTTACATTCAACTCAGGACTTGTTCTTACCTTTCTTTTCC
## AATGTTACTTGGTTCCATGCTATACATGTCTCTGGGACCAATGGTACTAAGAGGTTTGAT
## AACCCTGTCCTACCATTTAATGATGGTGTTTATTTTGCTTCCACTGAGAAGTCTAACATA
## ATAAGAGGCTGGATTTTTGGTACTACTTTAGATTC
## >GENE-10
## ATGAAATTTCTTGTTTTCTTAGGAATCATCACAACTGTAGCTGCATTTCACCAAGAATGT
## AGTTTACAGTCATGTACTCAACATCAACCATATGTAGTTGATGACCCGTGTCCTATTCAC
## TTCTATTCTAAATGGTATATTAGAGTAGGAGCTAGAAAATC

3.5.2 Split sequences

Display help information for split subcommand:

bash
seqkit split --help
## split sequences into files by name ID, subsequence of given region,
## part size or number of parts.
## 
## If you just want to split by parts or sizes, please use "seqkit split2",
## which also applies for paired- and single-end FASTQ.
## 
## The definition of region is 1-based and with some custom design.
## 
## Examples:
## 
##  1-based index    1 2 3 4 5 6 7 8 9 10
## negative index    0-9-8-7-6-5-4-3-2-1
##            seq    A C G T N a c g t n
##            1:1    A
##            2:4      C G T
##          -4:-2                c g t
##          -4:-1                c g t n
##          -1:-1                      n
##           2:-2      C G T N a c g t
##           1:-1    A C G T N a c g t n
##           1:12    A C G T N a c g t n
##         -12:-1    A C G T N a c g t n
## 
## Usage:
##   seqkit split [flags]
## 
## Flags:
##   -i, --by-id                     split squences according to sequence ID
##       --by-id-prefix string       file prefix for --by-id
##   -p, --by-part int               split sequences into N parts
##       --by-part-prefix string     file prefix for --by-part
##   -r, --by-region string          split squences according to subsequence of given region. e.g 1:12 for first 12 bases, -12:-1 for last 12 bases. type "seqkit split -h" for more examples
##       --by-region-prefix string   file prefix for --by-region
##   -s, --by-size int               split sequences into multi parts with N sequences
##       --by-size-prefix string     file prefix for --by-size
##   -d, --dry-run                   dry run, just print message and no files will be created.
##   -e, --extension string          set output file extension, e.g., ".gz", ".xz", or ".zst"
##   -f, --force                     overwrite output directory
##   -h, --help                      help for split
##   -k, --keep-temp                 keep temporary FASTA and .fai file when using 2-pass mode
##   -O, --out-dir string            output directory (default value is $infile.split)
##   -2, --two-pass                  two-pass mode read files twice to lower memory usage. (only for FASTA format)
## 
## Global Flags:
##       --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
##       --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
##       --id-regexp string                regular expression for parsing ID (default "^(\\S+)\\s?")
##       --infile-list string              file of input files list (one file per line), if given, they are appended to files from cli arguments
##   -w, --line-width int                  line width when outputting FASTA format (0 for no wrap) (default 60)
##   -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
##       --quiet                           be quiet and do not show extra information
##   -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
##   -j, --threads int                     number of CPUs. can also set with environment variable SEQKIT_THREADS) (default 4)

Split sequences into files by name:

bash
seqkit split --by-id tutorial/data/example/genes.fasta
## [INFO] split by ID. idRegexp: ^(\S+)\s?
## [INFO] read sequences ...
## [INFO] read 10 sequences
## [INFO] write 1 sequences to file: tutorial/data/example/genes.fasta.split/genes.part_GENE-3.fasta
## [INFO] write 1 sequences to file: tutorial/data/example/genes.fasta.split/genes.part_GENE-4.fasta
## [INFO] write 1 sequences to file: tutorial/data/example/genes.fasta.split/genes.part_GENE-6.fasta
## [INFO] write 1 sequences to file: tutorial/data/example/genes.fasta.split/genes.part_GENE-7.fasta
## [INFO] write 1 sequences to file: tutorial/data/example/genes.fasta.split/genes.part_GENE-8.fasta
## [INFO] write 1 sequences to file: tutorial/data/example/genes.fasta.split/genes.part_GENE-1.fasta
## [INFO] write 1 sequences to file: tutorial/data/example/genes.fasta.split/genes.part_GENE-5.fasta
## [INFO] write 1 sequences to file: tutorial/data/example/genes.fasta.split/genes.part_GENE-9.fasta
## [INFO] write 1 sequences to file: tutorial/data/example/genes.fasta.split/genes.part_GENE-10.fasta
## [INFO] write 1 sequences to file: tutorial/data/example/genes.fasta.split/genes.part_GENE-2.fasta

Split sequences into multiple files with N sequences:

bash
# 2 sequences per file
seqkit split --by-size 2 tutorial/data/example/genes.fasta
## [WARN] outdir not empty: tutorial/data/example/genes.fasta.split, you can use --force to overwrite
## [INFO] split into 2 seqs per file
## [INFO] write 2 sequences to file: tutorial/data/example/genes.fasta.split/genes.part_001.fasta
## [INFO] write 2 sequences to file: tutorial/data/example/genes.fasta.split/genes.part_002.fasta
## [INFO] write 2 sequences to file: tutorial/data/example/genes.fasta.split/genes.part_003.fasta
## [INFO] write 2 sequences to file: tutorial/data/example/genes.fasta.split/genes.part_004.fasta
## [INFO] write 2 sequences to file: tutorial/data/example/genes.fasta.split/genes.part_005.fasta

3.6 Edit sequence files

3.6.1 Replace name/sequence

Display help information for replace subcommand:

bash
seqkit replace --help
## replace name/sequence by regular expression.
## 
## Note that the replacement supports capture variables.
## e.g. $1 represents the text of the first submatch.
## ATTENTION: use SINGLE quote NOT double quotes in *nix OS.
## 
## Examples: Adding space to all bases.
## 
##     seqkit replace -p "(.)" -r '$1 ' -s
## 
## Or use the \ escape character.
## 
##     seqkit replace -p "(.)" -r "\$1 " -s
## 
## more on: http://bioinf.shenwei.me/seqkit/usage/#replace
## 
## Special replacement symbols (only for replacing name not sequence):
## 
##     {nr}    Record number, starting from 1
##     {kv}    Corresponding value of the key (captured variable $n) by key-value file,
##             n can be specified by flag -I (--key-capt-idx) (default: 1)
##             
## Special cases:
##   1. If replacements contain '$', 
##     a). If using '{kv}', you need use '$$$$' instead of a single '$':
##             -r '{kv}' -k <(sed 's/\$/$$$$/' kv.txt)
##     b). If not, use '$$':
##             -r 'xxx$$xx'
## 
## Usage:
##   seqkit replace [flags]
## 
## Flags:
##   -s, --by-seq                 replace seq (only FASTA)
##   -h, --help                   help for replace
##   -i, --ignore-case            ignore case
##   -K, --keep-key               keep the key as value when no value found for the key (only for sequence name)
##   -U, --keep-untouch           do not change anything when no value found for the key (only for sequence name)
##   -I, --key-capt-idx int       capture variable index of key (1-based) (default 1)
##   -m, --key-miss-repl string   replacement for key with no corresponding value
##   -k, --kv-file string         tab-delimited key-value file for replacing key with value when using "{kv}" in -r (--replacement) (only for sequence name)
##       --nr-width int           minimum width for {nr} in flag -r/--replacement. e.g., formatting "1" to "001" by --nr-width 3 (default 1)
##   -p, --pattern string         search regular expression
##   -r, --replacement string     replacement. supporting capture variables.  e.g. $1 represents the text of the first submatch. ATTENTION: for *nix OS, use SINGLE quote NOT double quotes or use the \ escape character. Record number is also supported by "{nr}".use ${1} instead of $1 when {kv} given!
## 
## Global Flags:
##       --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
##       --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
##       --id-regexp string                regular expression for parsing ID (default "^(\\S+)\\s?")
##       --infile-list string              file of input files list (one file per line), if given, they are appended to files from cli arguments
##   -w, --line-width int                  line width when outputting FASTA format (0 for no wrap) (default 60)
##   -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
##       --quiet                           be quiet and do not show extra information
##   -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
##   -j, --threads int                     number of CPUs. can also set with environment variable SEQKIT_THREADS) (default 4)

Replace name by regular expression:

bash
seqkit replace --pattern "-" --replacement "_" tutorial/data/example/genes.fasta
## >GENE_1
## ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCT
## GTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACT
## CACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATC
## TTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTT
## CGTCCGGGTGTGACCGAAAGGTAAG
## >GENE_2
## ATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTT
## TTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCA
## GAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTT
## TTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGATGCTCGAA
## >GENE_3
## ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACC
## AGAACTCAATTACCCCCTGCATACACTAATTCTTTCACACGTGGTGTTTATTACCCTGAC
## AAAGTTTTCAGATCCTCAGTTTTACATTCAACTCAGGACTTGTTCTTACCTTTCTTTTCC
## AATGTTACTTGGTTCCATGCTATACATGTCTCTGGGACCAATGGTACTAAGAGGTTTGAT
## AACCCTGTCCTACCATTTAATGATGGTGTTTATTTTGCTTCCACTGAGAAGTCTAACATA
## ATAAGAGGCTGGATTTTTGGTACTACTTTAGATTC
## >GENE_4
## ATGGATTTGTTTATGAGAATCTTCACAATTGGAACTGTAACTTTGAAGCAAGGTGAAATC
## AAGGATGCTACTCCTTCAGATTTTGTTCGCGCTACTGCAACGATACCGATACAAGCCTCA
## CTCCCTTTCGGATGGCTTATTGTTGGCGTTGCACTTCTTGCTG
## >GENE_5
## ATGTACTCATTCGTTTCGGAAGAGACAGGTACGTTAATAGTTAATAGCGTACTTCTTTTT
## CTTGCTTTCGTGGTATTCTTGC
## >GENE_6
## ATGGCAGATTCCAACGGTACTATTACCGTTGAAGAGCTTAAAAAGCTCCTTGAACAATGG
## AACCTAGTAATAGGTTTCCTATTCCTTACATGGATTTGTCTTCTACAATTTGCCTATGCC
## AACAGGAATAGGTTTTTGTATATAATTAAGTTAATTTTCCTCTGGCTGT
## >GENE_7
## ATGTTTCATCTCGTTGACTTTCAGGTTACTATAGCAGAGATATTACTAATTATTATGAGG
## ACTTTTAAAGTTTCCAT
## >GENE_8
## ATGAAAATTATTCTTTTCTTGGCACTGATAACACTCGCTACTTGTGAGCTTTATCACTAC
## CAAGAGTGTGTTAGAGGTACAACAGTACTTTTAAAAGAACCTTGCTCTTCTGGAACATAC
## GAGGGCAATTCACCATTTCATCCTCTAGCTGATAACAAATTTGCACTGACTTGCTTTAGC
## ACTCA
## >GENE_9
## ATGATTGAACTTTCATTAATTGACTTCTATTTGTGCTTTTTAGCCTTTCTGCTATTCCTT
## GTTTTAATTATGCTTATTATCTTTTGGTTCTCACTTGAACTG
## >GENE_10
## ATGAAATTTCTTGTTTTCTTAGGAATCATCACAACTGTAGCTGCATTTCACCAAGAATGT
## AGTTTACAGTCATGTACTCAACATCAACCATATGTAGTTGATGACCCGTGTCCTATTCAC
## TTCTATTCTAAATGGTATATTAGAGTAGGAGCTAGAAAATC

Replace sequence by regular expression:

bash
seqkit replace --by-seq --pattern "AGC" --replacement "---" tutorial/data/example/genes.fasta
## >GENE-1
## ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCT
## GTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACT
## CACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATC
## TTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGC---CGATCATC---ACATCTAGGTTT
## CGTCCGGGTGTGACCGAAAGGTAAG
## >GENE-2
## ATGGAG---CTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTT
## TTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCA
## GAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTT
## TTGCCTCAACTTGAAC---CCTATGTGTTCATCAAACGTTCGGATGCTCGAA
## >GENE-3
## ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACC
## AGAACTCAATTACCCCCTGCATACACTAATTCTTTCACACGTGGTGTTTATTACCCTGAC
## AAAGTTTTCAGATCCTCAGTTTTACATTCAACTCAGGACTTGTTCTTACCTTTCTTTTCC
## AATGTTACTTGGTTCCATGCTATACATGTCTCTGGGACCAATGGTACTAAGAGGTTTGAT
## AACCCTGTCCTACCATTTAATGATGGTGTTTATTTTGCTTCCACTGAGAAGTCTAACATA
## ATAAGAGGCTGGATTTTTGGTACTACTTTAGATTC
## >GENE-4
## ATGGATTTGTTTATGAGAATCTTCACAATTGGAACTGTAACTTTGA---AAGGTGAAATC
## AAGGATGCTACTCCTTCAGATTTTGTTCGCGCTACTGCAACGATACCGATACA---CTCA
## CTCCCTTTCGGATGGCTTATTGTTGGCGTTGCACTTCTTGCTG
## >GENE-5
## ATGTACTCATTCGTTTCGGAAGAGACAGGTACGTTAATAGTTAAT---GTACTTCTTTTT
## CTTGCTTTCGTGGTATTCTTGC
## >GENE-6
## ATGGCAGATTCCAACGGTACTATTACCGTTGAAG---TTAAAA---TCCTTGAACAATGG
## AACCTAGTAATAGGTTTCCTATTCCTTACATGGATTTGTCTTCTACAATTTGCCTATGCC
## AACAGGAATAGGTTTTTGTATATAATTAAGTTAATTTTCCTCTGGCTGT
## >GENE-7
## ATGTTTCATCTCGTTGACTTTCAGGTTACTAT---AGAGATATTACTAATTATTATGAGG
## ACTTTTAAAGTTTCCAT
## >GENE-8
## ATGAAAATTATTCTTTTCTTGGCACTGATAACACTCGCTACTTGTG---TTTATCACTAC
## CAAGAGTGTGTTAGAGGTACAACAGTACTTTTAAAAGAACCTTGCTCTTCTGGAACATAC
## GAGGGCAATTCACCATTTCATCCTCT---TGATAACAAATTTGCACTGACTTGCTTT---
## ACTCA
## >GENE-9
## ATGATTGAACTTTCATTAATTGACTTCTATTTGTGCTTTTT---CTTTCTGCTATTCCTT
## GTTTTAATTATGCTTATTATCTTTTGGTTCTCACTTGAACTG
## >GENE-10
## ATGAAATTTCTTGTTTTCTTAGGAATCATCACAACTGT---TGCATTTCACCAAGAATGT
## AGTTTACAGTCATGTACTCAACATCAACCATATGTAGTTGATGACCCGTGTCCTATTCAC
## TTCTATTCTAAATGGTATATTAGAGTAGG---TAGAAAATC

3.7 Order sequence files

3.7.1 Sort sequences

Display help information for sort subcommand:

bash
seqkit sort --help
## sort sequences by id/name/sequence/length.
## 
## By default, all records will be readed into memory.
## For FASTA format, use flag -2 (--two-pass) to reduce memory usage. FASTQ not
## supported.
## 
## Firstly, seqkit reads the sequence head and length information.
## If the file is not plain FASTA file,
## seqkit will write the sequences to temporary files, and create FASTA index.
## 
## Secondly, seqkit sorts sequence by head and length information
## and extracts sequences by FASTA index.
## 
## Usage:
##   seqkit sort [flags]
## 
## Flags:
##   -b, --by-bases                by non-gap bases
##   -l, --by-length               by sequence length
##   -n, --by-name                 by full name instead of just id
##   -s, --by-seq                  by sequence
##   -G, --gap-letters string      gap letters (default "- \t.")
##   -h, --help                    help for sort
##   -i, --ignore-case             ignore case
##   -k, --keep-temp               keep temporary FASTA and .fai file when using 2-pass mode
##   -N, --natural-order           sort in natural order, when sorting by IDs/full name
##   -r, --reverse                 reverse the result
##   -L, --seq-prefix-length int   length of sequence prefix on which seqkit sorts by sequences (0 for whole sequence) (default 10000)
##   -2, --two-pass                two-pass mode read files twice to lower memory usage. (only for FASTA format)
## 
## Global Flags:
##       --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
##       --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
##       --id-regexp string                regular expression for parsing ID (default "^(\\S+)\\s?")
##       --infile-list string              file of input files list (one file per line), if given, they are appended to files from cli arguments
##   -w, --line-width int                  line width when outputting FASTA format (0 for no wrap) (default 60)
##   -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
##       --quiet                           be quiet and do not show extra information
##   -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
##   -j, --threads int                     number of CPUs. can also set with environment variable SEQKIT_THREADS) (default 4)

Sort file by sequence name:

bash
seqkit sort --by-name tutorial/data/example/genes.fasta
## [INFO] read sequences ...
## [INFO] 10 sequences loaded
## [INFO] sorting ...
## [INFO] output ...
## >GENE-1
## ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCT
## GTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACT
## CACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATC
## TTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTT
## CGTCCGGGTGTGACCGAAAGGTAAG
## >GENE-10
## ATGAAATTTCTTGTTTTCTTAGGAATCATCACAACTGTAGCTGCATTTCACCAAGAATGT
## AGTTTACAGTCATGTACTCAACATCAACCATATGTAGTTGATGACCCGTGTCCTATTCAC
## TTCTATTCTAAATGGTATATTAGAGTAGGAGCTAGAAAATC
## >GENE-2
## ATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTT
## TTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCA
## GAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTT
## TTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGATGCTCGAA
## >GENE-3
## ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACC
## AGAACTCAATTACCCCCTGCATACACTAATTCTTTCACACGTGGTGTTTATTACCCTGAC
## AAAGTTTTCAGATCCTCAGTTTTACATTCAACTCAGGACTTGTTCTTACCTTTCTTTTCC
## AATGTTACTTGGTTCCATGCTATACATGTCTCTGGGACCAATGGTACTAAGAGGTTTGAT
## AACCCTGTCCTACCATTTAATGATGGTGTTTATTTTGCTTCCACTGAGAAGTCTAACATA
## ATAAGAGGCTGGATTTTTGGTACTACTTTAGATTC
## >GENE-4
## ATGGATTTGTTTATGAGAATCTTCACAATTGGAACTGTAACTTTGAAGCAAGGTGAAATC
## AAGGATGCTACTCCTTCAGATTTTGTTCGCGCTACTGCAACGATACCGATACAAGCCTCA
## CTCCCTTTCGGATGGCTTATTGTTGGCGTTGCACTTCTTGCTG
## >GENE-5
## ATGTACTCATTCGTTTCGGAAGAGACAGGTACGTTAATAGTTAATAGCGTACTTCTTTTT
## CTTGCTTTCGTGGTATTCTTGC
## >GENE-6
## ATGGCAGATTCCAACGGTACTATTACCGTTGAAGAGCTTAAAAAGCTCCTTGAACAATGG
## AACCTAGTAATAGGTTTCCTATTCCTTACATGGATTTGTCTTCTACAATTTGCCTATGCC
## AACAGGAATAGGTTTTTGTATATAATTAAGTTAATTTTCCTCTGGCTGT
## >GENE-7
## ATGTTTCATCTCGTTGACTTTCAGGTTACTATAGCAGAGATATTACTAATTATTATGAGG
## ACTTTTAAAGTTTCCAT
## >GENE-8
## ATGAAAATTATTCTTTTCTTGGCACTGATAACACTCGCTACTTGTGAGCTTTATCACTAC
## CAAGAGTGTGTTAGAGGTACAACAGTACTTTTAAAAGAACCTTGCTCTTCTGGAACATAC
## GAGGGCAATTCACCATTTCATCCTCTAGCTGATAACAAATTTGCACTGACTTGCTTTAGC
## ACTCA
## >GENE-9
## ATGATTGAACTTTCATTAATTGACTTCTATTTGTGCTTTTTAGCCTTTCTGCTATTCCTT
## GTTTTAATTATGCTTATTATCTTTTGGTTCTCACTTGAACTG

Sort file by sequence length:

bash
seqkit sort --by-length tutorial/data/example/genes.fasta
## [INFO] read sequences ...
## [INFO] 10 sequences loaded
## [INFO] sorting ...
## [INFO] output ...
## >GENE-7
## ATGTTTCATCTCGTTGACTTTCAGGTTACTATAGCAGAGATATTACTAATTATTATGAGG
## ACTTTTAAAGTTTCCAT
## >GENE-5
## ATGTACTCATTCGTTTCGGAAGAGACAGGTACGTTAATAGTTAATAGCGTACTTCTTTTT
## CTTGCTTTCGTGGTATTCTTGC
## >GENE-9
## ATGATTGAACTTTCATTAATTGACTTCTATTTGTGCTTTTTAGCCTTTCTGCTATTCCTT
## GTTTTAATTATGCTTATTATCTTTTGGTTCTCACTTGAACTG
## >GENE-10
## ATGAAATTTCTTGTTTTCTTAGGAATCATCACAACTGTAGCTGCATTTCACCAAGAATGT
## AGTTTACAGTCATGTACTCAACATCAACCATATGTAGTTGATGACCCGTGTCCTATTCAC
## TTCTATTCTAAATGGTATATTAGAGTAGGAGCTAGAAAATC
## >GENE-4
## ATGGATTTGTTTATGAGAATCTTCACAATTGGAACTGTAACTTTGAAGCAAGGTGAAATC
## AAGGATGCTACTCCTTCAGATTTTGTTCGCGCTACTGCAACGATACCGATACAAGCCTCA
## CTCCCTTTCGGATGGCTTATTGTTGGCGTTGCACTTCTTGCTG
## >GENE-6
## ATGGCAGATTCCAACGGTACTATTACCGTTGAAGAGCTTAAAAAGCTCCTTGAACAATGG
## AACCTAGTAATAGGTTTCCTATTCCTTACATGGATTTGTCTTCTACAATTTGCCTATGCC
## AACAGGAATAGGTTTTTGTATATAATTAAGTTAATTTTCCTCTGGCTGT
## >GENE-8
## ATGAAAATTATTCTTTTCTTGGCACTGATAACACTCGCTACTTGTGAGCTTTATCACTAC
## CAAGAGTGTGTTAGAGGTACAACAGTACTTTTAAAAGAACCTTGCTCTTCTGGAACATAC
## GAGGGCAATTCACCATTTCATCCTCTAGCTGATAACAAATTTGCACTGACTTGCTTTAGC
## ACTCA
## >GENE-2
## ATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTT
## TTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCA
## GAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTT
## TTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGATGCTCGAA
## >GENE-1
## ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCT
## GTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACT
## CACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATC
## TTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTT
## CGTCCGGGTGTGACCGAAAGGTAAG
## >GENE-3
## ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACC
## AGAACTCAATTACCCCCTGCATACACTAATTCTTTCACACGTGGTGTTTATTACCCTGAC
## AAAGTTTTCAGATCCTCAGTTTTACATTCAACTCAGGACTTGTTCTTACCTTTCTTTTCC
## AATGTTACTTGGTTCCATGCTATACATGTCTCTGGGACCAATGGTACTAAGAGGTTTGAT
## AACCCTGTCCTACCATTTAATGATGGTGTTTATTTTGCTTCCACTGAGAAGTCTAACATA
## ATAAGAGGCTGGATTTTTGGTACTACTTTAGATTC

3.7.2 Shuffle sequences

Display help information for shuffle subcommand:

bash
seqkit shuffle --help
## shuffle sequences.
## 
## By default, all records will be readed into memory.
## For FASTA format, use flag -2 (--two-pass) to reduce memory usage. FASTQ not
## supported.
## 
## Firstly, seqkit reads the sequence IDs. If the file is not plain FASTA file,
## seqkit will write the sequences to temporary files, and create FASTA index.
## 
## Secondly, seqkit shuffles sequence IDs and extract sequences by FASTA index.
## 
## Usage:
##   seqkit shuffle [flags]
## 
## Flags:
##   -h, --help            help for shuffle
##   -k, --keep-temp       keep temporary FASTA and .fai file when using 2-pass mode
##   -s, --rand-seed int   rand seed for shuffle (default 23)
##   -2, --two-pass        two-pass mode read files twice to lower memory usage. (only for FASTA format)
## 
## Global Flags:
##       --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
##       --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
##       --id-regexp string                regular expression for parsing ID (default "^(\\S+)\\s?")
##       --infile-list string              file of input files list (one file per line), if given, they are appended to files from cli arguments
##   -w, --line-width int                  line width when outputting FASTA format (0 for no wrap) (default 60)
##   -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
##       --quiet                           be quiet and do not show extra information
##   -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
##   -j, --threads int                     number of CPUs. can also set with environment variable SEQKIT_THREADS) (default 4)

Shuffle sequences:

bash
# set seed to reproduce result
seqkit shuffle --rand-seed 1701 tutorial/data/example/genes.fasta
## [INFO] read sequences ...
## [INFO] 10 sequences loaded
## [INFO] shuffle ...
## [INFO] output ...
## >GENE-5
## ATGTACTCATTCGTTTCGGAAGAGACAGGTACGTTAATAGTTAATAGCGTACTTCTTTTT
## CTTGCTTTCGTGGTATTCTTGC
## >GENE-3
## ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACC
## AGAACTCAATTACCCCCTGCATACACTAATTCTTTCACACGTGGTGTTTATTACCCTGAC
## AAAGTTTTCAGATCCTCAGTTTTACATTCAACTCAGGACTTGTTCTTACCTTTCTTTTCC
## AATGTTACTTGGTTCCATGCTATACATGTCTCTGGGACCAATGGTACTAAGAGGTTTGAT
## AACCCTGTCCTACCATTTAATGATGGTGTTTATTTTGCTTCCACTGAGAAGTCTAACATA
## ATAAGAGGCTGGATTTTTGGTACTACTTTAGATTC
## >GENE-8
## ATGAAAATTATTCTTTTCTTGGCACTGATAACACTCGCTACTTGTGAGCTTTATCACTAC
## CAAGAGTGTGTTAGAGGTACAACAGTACTTTTAAAAGAACCTTGCTCTTCTGGAACATAC
## GAGGGCAATTCACCATTTCATCCTCTAGCTGATAACAAATTTGCACTGACTTGCTTTAGC
## ACTCA
## >GENE-6
## ATGGCAGATTCCAACGGTACTATTACCGTTGAAGAGCTTAAAAAGCTCCTTGAACAATGG
## AACCTAGTAATAGGTTTCCTATTCCTTACATGGATTTGTCTTCTACAATTTGCCTATGCC
## AACAGGAATAGGTTTTTGTATATAATTAAGTTAATTTTCCTCTGGCTGT
## >GENE-9
## ATGATTGAACTTTCATTAATTGACTTCTATTTGTGCTTTTTAGCCTTTCTGCTATTCCTT
## GTTTTAATTATGCTTATTATCTTTTGGTTCTCACTTGAACTG
## >GENE-1
## ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCT
## GTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACT
## CACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATC
## TTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTT
## CGTCCGGGTGTGACCGAAAGGTAAG
## >GENE-2
## ATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTT
## TTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCA
## GAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTT
## TTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGATGCTCGAA
## >GENE-10
## ATGAAATTTCTTGTTTTCTTAGGAATCATCACAACTGTAGCTGCATTTCACCAAGAATGT
## AGTTTACAGTCATGTACTCAACATCAACCATATGTAGTTGATGACCCGTGTCCTATTCAC
## TTCTATTCTAAATGGTATATTAGAGTAGGAGCTAGAAAATC
## >GENE-4
## ATGGATTTGTTTATGAGAATCTTCACAATTGGAACTGTAACTTTGAAGCAAGGTGAAATC
## AAGGATGCTACTCCTTCAGATTTTGTTCGCGCTACTGCAACGATACCGATACAAGCCTCA
## CTCCCTTTCGGATGGCTTATTGTTGGCGTTGCACTTCTTGCTG
## >GENE-7
## ATGTTTCATCTCGTTGACTTTCAGGTTACTATAGCAGAGATATTACTAATTATTATGAGG
## ACTTTTAAAGTTTCCAT

4 Exercises

The exercies below are designed to strengthen your knowledge of using SeqKit for FASTA/Q parsing. The solution to each problem is blurred, only after attempting to solve the problem yourself should you look at the solution. Should you need any help, please ask one of the instructors.

Create a directory to store the output files from each exercise:

bash
mkdir exercises
mkdir exercises/covid19
mkdir exercises/malaria
mkdir exercises/leishmania

4.1 Coronavirus

Coronavirus disease (COVID-19) is an infectious disease caused by the SARS-CoV-2 virus. Most people infected with the virus will experience mild to moderate respiratory illness and recover without requiring special treatment. However, some will become seriously ill and require medical attention.

The covid19 directory contains the COVID-19 reference genome in FASTA format. This was downloaded from the COVID-19 portal maintained by the Ensembl genome database. Using the SeqKit software, answer the follow questions about the viral genome:

  1. How long is the genome?
bash
# 29,903
seqkit stats tutorial/data/covid19/GCA_009858895.3.fasta > exercises/covid19/genome-length.txt
  1. What is the GC content of the genome?
bash
# 37.97
seqkit fx2tab --name --gc tutorial/data/covid19/GCA_009858895.3.fasta > exercises/covid19/genome-gc.txt
  1. How many “GAGAGA” subsequences are in the genome?
bash
# 12
seqkit locate --pattern "GAGAGA" tutorial/data/covid19/GCA_009858895.3.fasta | tail -n +2 | wc -l > exercises/covid19/genome-pattern.txt
  1. Translate the DNA sequence to its corresponding protein sequence:
bash
seqkit translate tutorial/data/covid19/GCA_009858895.3.fasta | head  > exercises/covid19/genome-protein.txt
  1. How many STOP characters are in the protein sequence?
bash
# 116
seqkit translate tutorial/data/covid19/GCA_009858895.3.fasta | grep --count "*" > exercises/covid19/genome-stop.txt

4.2 Malaria

Plasmodium falciparum is the malarial parasite most dangerous to humans, accounting for over 90% of all malarial deaths, and was the first species of the genus Plasmodium to be sequenced. Its genome is notable for an exceptionally low GC content of under 20%. In other respects, the genome has a similar size (23.3 Mb) and gene count (about 5300) to other species of malaria.

The malaria directory contains a FASTA file of all the cDNA sequences in the parasite genome. The file was downloaded from the Plasmodium falciparum portal maintained by the Ensembl genome database. Using the SeqKit software, answer the follow questions about the parasite genes:

  1. How many genes are in the genome?
bash
# 5,515
seqkit stats tutorial/data/malaria/Plasmodium_falciparum.ASM276v2.cdna.all.fa > exercises/malaria/total-genes.txt
  1. What is the name of the longest gene?
bash
# PF3D7_0628100
seqkit sort --by-length --reverse tutorial/data/malaria/Plasmodium_falciparum.ASM276v2.cdna.all.fa | seqkit head -n 1 | seqkit seq --name > exercises/malaria/longest-gene.txt
## [INFO] read sequences ...
## [INFO] 5515 sequences loaded
## [INFO] sorting ...
## [INFO] output ...
  1. What is the range of GC content across all genes? Use the watch subcommand to help you answer this question.
bash
# 14.90 to 50.98
seqkit watch --fields GC --img exercises/malaria/genes-gc.png tutorial/data/malaria/Plasmodium_falciparum.ASM276v2.cdna.all.fa
##                                       GC content                                
## 0.113┤          _                                
## 0.110┤       _  █                                
## 0.101┤       █ _█                                
## 0.093┤       █_██_                               
## 0.081┤       █████_                              
##      │       ██████                              
## 0.072┤      _██████_                             
## 0.062┤      ████████__                           
## 0.051┤     _██████████                           
## 0.046┤     ███████████__                         
##      │     █████████████                         
## 0.026┤    _█████████████__                       
## 0.022┤    ████████████████___                    
## 0.011┤   _███████████████████___                 
##      │──────────────────────────────────────────────────────────────────────────
##      │1111111222222222223333333333334444444444445                         
##      │4567899012345567890012345567890012345567890                         
##      │...........................................                         
##      │9764319864310865320875420976421976431986531                         
##      │Count: 5515 Mean: 25.03 Stdev: 4.18 Min: 14.90 Max: 50.98 Precision: 2 Bins: 43
  1. Extract the name and sequence of the gene which causes resistance to the antimalarial drug chloroquine:
bash
# Search sequences mentioning "chloroquine" and extract the sequence by name
grep "chloroquine" tutorial/data/malaria/Plasmodium_falciparum.ASM276v2.cdna.all.fa
seqkit grep --pattern "CAD50842" tutorial/data/malaria/Plasmodium_falciparum.ASM276v2.cdna.all.fa > exercises/malaria/drug-gene.txt
## >CAD50842 cdna chromosome:ASM276v2:7:403222:406317:1 gene:PF3D7_0709000 gene_biotype:protein_coding transcript_biotype:protein_coding description:chloroquine resistance transporter
  1. Extract subsequences of the chloroquine resistance gene in sliding windows (step size = 1, window size = 10) and produce a histogram of the GC content:
bash
seqkit grep --pattern "CAD50842" tutorial/data/malaria/Plasmodium_falciparum.ASM276v2.cdna.all.fa | seqkit sliding --step 1 --window 10 | seqkit watch --fields GC --img exercises/malaria/gc-hist.png
##                                       GC content                                
## 0.041┤                        ______                                    
## 0.040┤                  ______│████│                                    
##      │                  │████││████│                                    
##      │                  │████││████│                                    
##      │                  │████││████│                                    
## 0.029┤                  │████││████│      ______                        
##      │                  │████││████│      │████│                        
## 0.022┤      ______      │████││████│      │████│                        
##      │      │████│      │████││████│      │████│                        
##      │      │████│      │████││████│      │████│                        
## 0.014┤      │████│      │████││████│      │████│______                  
##      │      │████│      │████││████│      │████││████│                  
## 0.006┤______│████│      │████││████│      │████││████│                  
## 0.006┤│████││████│      │████││████│      │████││████│      ______      
##      │──────────────────────────────────────────────────────────────────────────
##      │0     6     1     1     2     3     3     4     5     5     6       
##      │.     .     2     9     5     1     8     4     0     7     3       
##      │0     3     .     .     .     .     .     .     .     .     .       
##      │0     6     7     1     5     8     2     5     9     3     6       
##      │Count: 1266 Mean: 28.4 Stdev: 14.3 Min: 0.0 Max: 70.0 Precision: 1 Bins: 11

4.3 Leishmaniasis

Leishmania is a Tryanosomatid protozoa and is the parasite responsible for the disease Leishmaniasis. Leishmania is transmitted by the bite of certain species of sand fly and affects the populations of 88 tropical and sub-tropical countries worldwide. The symptoms are cutaneous and muco-cutaneous lesions initially around the bite, the parasite can also migrate causing visceral leishmaniasis affecting the haemopoietic organs.

The leishmania directory contains the Leishmania major reference genome and gene sequences in FASTA format. The file was downloaded from the Plasmodium falciparum portal maintained by the Ensembl genome database. Using the SeqKit software, answer the follow questions about the parasite genome:

  1. What is the total number of polar amino acids (SCNBQZTY) in the protein sequences?
bash
# 2,706
seqkit translate tutorial/data/leishmania/Leishmania_major.ASM272v2.cdna.all.fa | grep -c "[SCNBQZTY]" > exercises/leishmania/total-polar.txt
  1. How many genes have duplicate sequences and what are their names?
bash
# 2 CBZ11846, CBZ11847
# 2 CBZ11883, CBZ11882
seqkit rmdup --by-seq --ignore-case --dup-num-file exercises/leishmania/dupnum.txt --out-file exercises/leishmania/dupfile.txt tutorial/data/leishmania/Leishmania_major.ASM272v2.cdna.all.fa
cat exercises/leishmania/dupnum.txt
## [INFO] 2 duplicated records removed
## 2    CBZ11846, CBZ11847
## 2    CBZ11883, CBZ11882
  1. What is the name of the 28th longest gene?
bash
# CBZ11909
seqkit sort --by-length --reverse tutorial/data/leishmania/Leishmania_major.ASM272v2.cdna.all.fa | seqkit range --range 28:28 | seqkit seq --name > exercises/leishmania/name-28-long.txt
## [INFO] read sequences ...
## [INFO] 202 sequences loaded
## [INFO] sorting ...
## [INFO] output ...
  1. How many SaCas9 PAM sites (NGRRT or NGRRN) are present on chromosome 1?
bash
# 38,139
seqkit grep --pattern "1" tutorial/data/leishmania/Leishmania_major.ASM272v2.dna.toplevel.fa | seqkit locate --pattern "NGRRT|NGRRN" --degenerate | tail -n +2 | wc -l > exercises/leishmania/cas9-sites.txt
  1. What is the average GC content of the regions 20 bases upstream of the SaCas9 PAM sites?
bash
# 63.1
seqkit grep --pattern "1" tutorial/data/leishmania/Leishmania_major.ASM272v2.dna.toplevel.fa | seqkit locate --bed --pattern "NGRRT|NGRRN" --degenerate > exercises/leishmania/cas9-sites.bed
seqkit subseq --bed exercises/leishmania/cas9-sites.bed --only-flank --up-stream 20 tutorial/data/leishmania/Leishmania_major.ASM272v2.dna.toplevel.fa | seqkit watch --fields GC --img exercises/leishmania/cas9-sites-gc.png
## [INFO] read BED file ...
## [INFO] 38139 BED features loaded
## [INFO] create FASTA index for tutorial/data/leishmania/Leishmania_major.ASM272v2.dna.toplevel.fa
##                                       GC content                                
## 0.074┤                         _              
##      │                         █              
## 0.065┤                         █ _            
##      │                         █ █            
##      │                         █ █            
## 0.049┤                       _ █ █            
## 0.044┤                      _█ █ █ _          
##      │                      ██ █ █ █          
## 0.034┤                   _  ██ █ █ █          
##      │                   █  ██ █ █ █          
## 0.024┤                   █  ██ █ █ █ _        
## 0.017┤                 _ █  ██_█ █ █ █        
##      │                 █ █  ████ █ █ █        
## 0.008┤               _ █ █ _████ █ █ █        
##      │──────────────────────────────────────────────────────────────────────────
##      │0257111122223333444455556666777788889999                            
##      │....025702570257025702570257025702570257                            
##      │0505....................................                            
##      │0000050505050505050505050505050505050505                            
##      │Count: 38139 Mean: 63.1 Stdev: 11.3 Min: 0.0 Max: 100.0 Precision: 1 Bins: 40